<!DOCTYPE html>
<html lang="en">
    <!--
      Licensed to the Apache Software Foundation (ASF) under one or more
      contributor license agreements.  See the NOTICE file distributed with
      this work for additional information regarding copyright ownership.
      The ASF licenses this file to You under the Apache License, Version 2.0
      (the "License"); you may not use this file except in compliance with
      the License.  You may obtain a copy of the License at
          http://www.apache.org/licenses/LICENSE-2.0
      Unless required by applicable law or agreed to in writing, software
      distributed under the License is distributed on an "AS IS" BASIS,
      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
      See the License for the specific language governing permissions and
      limitations under the License.
    -->
    <head>
        <meta charset="utf-8" />
        <title>GenerateTableFetch</title>

        <link rel="stylesheet" href="../../../../../css/component-usage.css" type="text/css" />
    </head>

    <body>
    	<p>
    		GenerateTableFetch uses its properties and the specified database connection to generate flow files
			containing SQL statements that can be used to fetch "pages" (aka "partitions") of data from a table.
            GenerateTableFetch executes a query to the database to determine the current row count and maximum value,
            and if Maximum Value Columns are specified, will collect the count of rows whose values for the Maximum
            Value Columns are larger than those last observed by GenerateTableFetch. This allows for incremental
            fetching of "new" rows, rather than generating SQL to fetch the entire table each time. If no Maximum
            Value Columns are set, then the processor will generate SQL to fetch the entire table each time.
        </p>
    	
    	<p>
    		In order to generate SQL that will fetch pages/partitions of data, by default GenerateTableFetch will
            generate SQL that orders the data based on the Maximum Value Columns (if present) and utilize the row
            numbers of the result set to determine each page. For example if the Maximum Value Column is an integer "id"
            and the partition size is 10, then the SQL for the first page might be "SELECT * FROM myTable LIMIT 10" and
            the second page might be "SELECT * FROM myTable OFFSET 10 LIMIT 10", and so on.
    	</p>
    	
    	<p>
			Ordering the data can be an expensive operation depending on the database, the number of rows, etc.
            Alternatively, it is possible to specify a column whose values will be used to determine the pages, using
            the Column for Value Partitioning property. If set, GenerateTableFetch will determine the minimum and
            maximum values for the column, and uses the minimum value as the initial offset. The SQL to fetch a page is
            then based on this initial offset and the total difference in values (i.e. maximum - minimum) divided by
            the page size. For example, if the column "id" is used for value partitioning, and the column contains
            values 100 to 200, then with a page size of 10 the SQL to fetch the first page might be "SELECT * FROM
            myTable WHERE id >= 100 AND id < 110" and the second page might be "SELECT * FROM myTable WHERE id >= 110
            AND id < 120", and so on.
        </p>

        <p>
            It is important that the Column for Value Partitioning be set to a column whose type can be coerced to a
            long integer (i.e. not date or timestamp), and that the column values are evenly distributed and not
            sparse, for best performance. As a counterexample to the above, consider a column "id" whose values are 100,
            2000, and 30000. If the Partition Size is 100, then the column values are relatively sparse, so the SQL
            for the "second page" (see above example) will return zero rows, and so will every page until the value in
            the query becomes "id >= 2000". Another counterexample is when the values are not uniformly distributed.
            Consider a column "id" with values 100, 200, 201, 202, ... 299. Then the SQL for the first page (see above
            example) will return one row with value id = 100, and the second page will return 100 rows with values 200
            ... 299. This can cause inconsistent processing times downstream, as the pages may contain a very different
            number of rows. For these reasons it is recommended to use a Column for Value Partitioning that is
            sufficiently dense (not sparse) and fairly evenly distributed.
    	</p>
	</body>
</html>